CM4125 Week 7: Time Series Data + Running Bar Charts

Lecture objectives

1) Understand the concept of time series data and what does it imply

2) Process time series datasets into effective dataframes

3) Produce running bar charts using a variety of tools

What is Time Series Data (TSD)?

Simply put, datasets that describe/contain data which varies over time

They are usually characterised by a fixed and constant aggregation of data

Readings from sensors in the industry
Coronavirus cases
Trends in economics

The art of using these datasets resides in knowing how to shape them and using the right features

To practice, we will use the dataset called bitcoin.csv

This data set contains the price of the Bitcoin cryptocurrency on the US-based exchange Gemini, minute-by-minute for all of 2017

In [2]:
import pandas as pd
import numpy as np
bitcoin = pd.read_csv('https://www.dropbox.com/s/kxjc92oia6gcoy8/bitcoin.csv?raw=1', index_col=0)
bitcoin
Out[2]:
Open High Low Close Volume
Date
2017-01-01 00:00:00 974.55 974.55 974.55 974.55 0.000000
2017-01-01 00:01:00 974.55 974.55 974.55 974.55 0.000000
2017-01-01 00:02:00 974.55 974.55 970.00 970.00 0.417679
2017-01-01 00:03:00 970.00 970.00 970.00 970.00 0.000514
2017-01-01 00:04:00 970.00 970.00 970.00 970.00 0.000000
... ... ... ... ... ...
2017-12-31 23:55:00 13782.87 13782.87 13771.00 13771.00 1.657980
2017-12-31 23:56:00 13771.00 13775.00 13770.95 13775.00 7.538428
2017-12-31 23:57:00 13775.00 13815.37 13775.00 13815.37 18.437304
2017-12-31 23:58:00 13815.37 13825.00 13815.37 13825.00 3.547593
2017-12-31 23:59:00 13825.00 13825.00 13804.68 13820.26 2.610719

525600 rows × 5 columns

Note that the Date column contains entries formatted as a date-time such as 2017-01-01 00:00:00

First, let's check the type of this data by examining the first item

In [3]:
first_index_value = bitcoin.index[0]
print(first_index_value)
print(type(first_index_value))
2017-01-01 00:00:00
<class 'str'>

We can avoid this in one go by parsing the dates when loading

In [4]:
bitcoin = pd.read_csv('https://www.dropbox.com/s/kxjc92oia6gcoy8/bitcoin.csv?raw=1', 
                      parse_dates=['Date'], index_col=0)
bitcoin
Out[4]:
Open High Low Close Volume
Date
2017-01-01 00:00:00 974.55 974.55 974.55 974.55 0.000000
2017-01-01 00:01:00 974.55 974.55 974.55 974.55 0.000000
2017-01-01 00:02:00 974.55 974.55 970.00 970.00 0.417679
2017-01-01 00:03:00 970.00 970.00 970.00 970.00 0.000514
2017-01-01 00:04:00 970.00 970.00 970.00 970.00 0.000000
... ... ... ... ... ...
2017-12-31 23:55:00 13782.87 13782.87 13771.00 13771.00 1.657980
2017-12-31 23:56:00 13771.00 13775.00 13770.95 13775.00 7.538428
2017-12-31 23:57:00 13775.00 13815.37 13775.00 13815.37 18.437304
2017-12-31 23:58:00 13815.37 13825.00 13815.37 13825.00 3.547593
2017-12-31 23:59:00 13825.00 13825.00 13804.68 13820.26 2.610719

525600 rows × 5 columns

If we check the column now, it is marked as Timestamp

In [5]:
first_index_value = bitcoin.index[0]
print(first_index_value)
print(type(first_index_value))
2017-01-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Exploring TSD

This data set contains the following columns:

Price Data:

  • Open - The price of the commodity (bitcoin) at the start of the time period (minute)
  • High - The highest value the price reached during the time period
  • Low - The lowest value the price reached during the time period
  • Close - The price at the end of the time period

Trade Data:

  • Volume - The total amount of the commodity bought/sold in the time period

Other time series data sets with have other columns, but this is common for financial data.

Even though the index is stored as Timestamp, we can retrieve rows by querying with a string

For example during the minute 2017-05-27 19:30:00, we can query using 2017-05-27 19:30:00 which matches the data

However, since the data set has only minute-by-minute data, the seconds don't matter, so we can exclude the seconds in our query and search for 2017-05-27 19:30

In [6]:
bitcoin.loc['2017-05-27 19:30']
Out[6]:
Open      2144.68000
High      2144.68000
Low       2141.26000
Close     2142.53000
Volume       0.05406
Name: 2017-05-27 19:30:00, dtype: float64

The price started this minute at 2144.68 USD, dropped to 2142.53 USD at the end of the minute

During this time the minimum value was 2141.26 USD and the maximum value was 2144.68 USD

0.05406 bitcoin was bought/sold during this minute

Let's check the next minute 2017-05-27 19:31

In [7]:
bitcoin.loc['2017-05-27 19:31']
Out[7]:
Open      2142.5300
High      2142.5300
Low       2141.2700
Close     2141.2700
Volume       0.2069
Name: 2017-05-27 19:31:00, dtype: float64

The previous minute closed at 2142.53 USD and this minute opened at 2142.53 USD

This is expected since the end of one minute is the same as the start of the next minute!

If we want to check the open price for the day and close price for the day, we could check the open price for the first minute and close price for the last minute

In [8]:
print('Date: 2017-05-27')
print('Open:', bitcoin.at['2017-05-27 00:00', 'Open'])
print('Close:', bitcoin.at['2017-05-27 23:59', 'Close'])
Date: 2017-05-27
Open: 2267.37
Close: 2096.77

We can select an interval of time the same way we do with numbers

With the following code, we select all of the values in May of 2017

In [9]:
bitcoin[(bitcoin.index >= '2017-05') & (bitcoin.index < '2017-06')]
Out[9]:
Open High Low Close Volume
Date
2017-05-01 00:00:00 1362.63 1362.63 1359.05 1359.05 0.129530
2017-05-01 00:01:00 1359.05 1359.05 1359.05 1359.05 0.015945
2017-05-01 00:02:00 1359.05 1359.05 1359.05 1359.05 0.430274
2017-05-01 00:03:00 1359.05 1359.05 1359.05 1359.05 0.024931
2017-05-01 00:04:00 1359.05 1359.05 1359.05 1359.05 0.000000
... ... ... ... ... ...
2017-05-31 23:55:00 2301.80 2301.80 2301.80 2301.80 0.000000
2017-05-31 23:56:00 2301.80 2301.80 2301.80 2301.80 0.065392
2017-05-31 23:57:00 2301.80 2301.80 2301.80 2301.80 0.000000
2017-05-31 23:58:00 2301.80 2301.80 2301.80 2301.80 0.000000
2017-05-31 23:59:00 2301.80 2303.62 2301.80 2303.62 13.587863

44640 rows × 5 columns

Selecting all of the values for 1st May before noon

In [10]:
bitcoin[(bitcoin.index >= '2017-05-01') & (bitcoin.index < '2017-05-01 12:00')]
Out[10]:
Open High Low Close Volume
Date
2017-05-01 00:00:00 1362.63 1362.63 1359.05 1359.05 0.129530
2017-05-01 00:01:00 1359.05 1359.05 1359.05 1359.05 0.015945
2017-05-01 00:02:00 1359.05 1359.05 1359.05 1359.05 0.430274
2017-05-01 00:03:00 1359.05 1359.05 1359.05 1359.05 0.024931
2017-05-01 00:04:00 1359.05 1359.05 1359.05 1359.05 0.000000
... ... ... ... ... ...
2017-05-01 11:55:00 1376.43 1376.43 1376.40 1376.43 0.064194
2017-05-01 11:56:00 1376.43 1377.36 1376.43 1377.36 0.013068
2017-05-01 11:57:00 1377.36 1377.36 1377.35 1377.35 0.023829
2017-05-01 11:58:00 1377.35 1378.25 1377.35 1378.24 17.260743
2017-05-01 11:59:00 1378.24 1378.24 1378.24 1378.24 0.000000

720 rows × 5 columns

Calculations with TSD

We may want to calculate other values such as taking the middle value between the Low and High values, this will allow us to calculate an estimate for the value of the bitcoin traded

It will only be an estimate. since we don't have the exact list of trades available in this dataset!

We can perform calculations on the columns in exactly the same we we have with any other data set before

In [11]:
bitcoin['Middle'] = (bitcoin['High'] + bitcoin['Low']) / 2
bitcoin['USD Volume'] = bitcoin['Middle'] * bitcoin['Volume']
bitcoin
Out[11]:
Open High Low Close Volume Middle USD Volume
Date
2017-01-01 00:00:00 974.55 974.55 974.55 974.55 0.000000 974.550 0.000000
2017-01-01 00:01:00 974.55 974.55 974.55 974.55 0.000000 974.550 0.000000
2017-01-01 00:02:00 974.55 974.55 970.00 970.00 0.417679 972.275 406.099132
2017-01-01 00:03:00 970.00 970.00 970.00 970.00 0.000514 970.000 0.498755
2017-01-01 00:04:00 970.00 970.00 970.00 970.00 0.000000 970.000 0.000000
... ... ... ... ... ... ... ...
2017-12-31 23:55:00 13782.87 13782.87 13771.00 13771.00 1.657980 13776.935 22841.878972
2017-12-31 23:56:00 13771.00 13775.00 13770.95 13775.00 7.538428 13772.975 103826.583413
2017-12-31 23:57:00 13775.00 13815.37 13775.00 13815.37 18.437304 13795.185 254346.017926
2017-12-31 23:58:00 13815.37 13825.00 13815.37 13825.00 3.547593 13820.185 49028.397646
2017-12-31 23:59:00 13825.00 13825.00 13804.68 13820.26 2.610719 13814.840 36066.660297

525600 rows × 7 columns

Resampling TSD

Sometimes we are interested in sampling the in larger chunks than it is presented

For instance, this data has one line per minute; we may want data for days instead

The method to resample the data in Python is called resample, and takes a time period as an argument

For example, 1D means 1 day (you can see more options here)

In [17]:
# Not very useful output right?
bitcoin.resample('1D')
Out[17]:
<pandas.core.resample.DatetimeIndexResampler object at 0x00000185A60DD6A0>

If we call the mean method, it will average out the values of this new object by day

In [18]:
bitcoin.resample('1D').mean()
Out[18]:
Open High Low Close Volume Middle USD Volume
Date
2017-01-01 980.776378 980.898135 980.663396 980.793059 0.743065 980.780766 737.218302
2017-01-02 1016.424132 1016.618785 1016.247319 1016.439174 1.006385 1016.433052 1025.061363
2017-01-03 1023.624340 1023.743431 1023.514653 1023.634618 1.596420 1023.629042 1636.278683
2017-01-04 1082.933979 1083.422104 1082.550431 1083.009069 2.258822 1082.986267 2464.408073
2017-01-05 1056.542243 1057.577083 1055.411271 1056.446188 4.190673 1056.494177 4165.407749
... ... ... ... ... ... ... ...
2017-12-27 15639.538896 15655.889035 15621.917583 15639.252847 6.427652 15638.903309 99412.047834
2017-12-28 14261.058215 14285.193507 14237.371021 14260.412146 7.639668 14261.282264 108742.624129
2017-12-29 14535.871354 14549.550931 14522.446875 14535.931340 4.438627 14535.998903 64443.408129
2017-12-30 13476.331806 13494.028972 13454.185951 13475.012840 6.065731 13474.107462 80474.783127
2017-12-31 13274.262319 13285.760785 13262.565792 13275.057167 5.678043 13274.163288 75170.569715

365 rows × 7 columns

The method takes the values in the column Volume and rows 2017-01-01 00:00 to 2017-01-01 23:59 and averaging them

We can also get the total by adding the values

In [19]:
bitcoin.resample('1D').sum()
Out[19]:
Open High Low Close Volume Middle USD Volume
Date
2017-01-01 1.412318e+06 1.412493e+06 1412155.29 1.412342e+06 1070.013119 1.412324e+06 1.061594e+06
2017-01-02 1.463651e+06 1.463931e+06 1463396.14 1.463672e+06 1449.193752 1.463664e+06 1.476088e+06
2017-01-03 1.474019e+06 1.474191e+06 1473861.10 1.474034e+06 2298.844903 1.474026e+06 2.356241e+06
2017-01-04 1.559425e+06 1.560128e+06 1558872.62 1.559533e+06 3252.704150 1.559500e+06 3.548748e+06
2017-01-05 1.521421e+06 1.522911e+06 1519792.23 1.521283e+06 6034.569411 1.521352e+06 5.998187e+06
... ... ... ... ... ... ... ...
2017-12-27 2.252094e+07 2.254448e+07 22495561.32 2.252052e+07 9255.819566 2.252002e+07 1.431533e+08
2017-12-28 2.053592e+07 2.057068e+07 20501814.27 2.053499e+07 11001.121472 2.053625e+07 1.565894e+08
2017-12-29 2.093165e+07 2.095135e+07 20912323.50 2.093174e+07 6391.623337 2.093184e+07 9.279851e+07
2017-12-30 1.940592e+07 1.943140e+07 19374027.77 1.940402e+07 8734.652933 1.940271e+07 1.158837e+08
2017-12-31 1.911494e+07 1.913150e+07 19098094.74 1.911608e+07 8176.382068 1.911480e+07 1.082456e+08

365 rows × 7 columns

This operation was more useful for the Volume column, as it now sums up all of the entries for that day

We can see that 1070.013119 bitcoin (about $1,061,594 at the time) were traded on 01/01

In fact there was another way to get this!

Multipying the bitcoin per minute rate from before (0.743065) times the number of minute in a day (1440)
In [20]:
1440*0.743065
Out[20]:
1070.0136

Resampling TSD by Column

If we want the lowest or highest price for the whole day, we don't need an average or a sum!

We can tell Pandas to use max on these values, and also we can use min for Low

In [25]:
bitcoin.resample('1D').max()
Out[25]:
Open High Low Close Volume Middle USD Volume
Date
2017-01-01 1003.54 1003.54 1003.54 1003.54 110.126689 1003.540 1.102911e+05
2017-01-02 1035.89 1035.90 1035.88 1035.89 116.627861 1035.885 1.197103e+05
2017-01-03 1038.26 1038.26 1038.26 1038.26 405.000039 1038.260 4.150481e+05
2017-01-04 1150.00 1155.54 1149.97 1150.00 167.000000 1149.980 1.839179e+05
2017-01-05 1167.16 1167.16 1167.16 1167.16 918.000000 1167.160 8.822990e+05
... ... ... ... ... ... ... ...
2017-12-27 16500.00 16500.00 16499.58 16500.00 115.322939 16499.790 1.704437e+06
2017-12-28 15495.96 15499.99 15493.84 15495.96 143.885194 15494.420 2.005490e+06
2017-12-29 15077.92 15115.10 15060.75 15077.92 172.666517 15087.925 2.448388e+06
2017-12-30 14607.75 14607.77 14577.60 14607.75 168.027243 14592.685 2.205184e+06
2017-12-31 14180.00 14191.20 14163.09 14180.00 106.797334 14171.545 1.435948e+06

365 rows × 7 columns

In [26]:
bitcoin.resample('1D').min()
Out[26]:
Open High Low Close Volume Middle USD Volume
Date
2017-01-01 965.00 965.00 965.00 965.00 0.0 965.000 0.0
2017-01-02 995.00 995.01 995.00 995.00 0.0 995.010 0.0
2017-01-03 1014.34 1014.35 1014.06 1014.34 0.0 1014.345 0.0
2017-01-04 1035.03 1035.03 1035.02 1035.03 0.0 1035.030 0.0
2017-01-05 879.00 899.87 879.00 879.00 0.0 892.365 0.0
... ... ... ... ... ... ... ...
2017-12-27 14541.60 14599.62 14521.00 14541.60 0.0 14560.500 0.0
2017-12-28 13532.44 13579.80 13514.81 13532.44 0.0 13556.120 0.0
2017-12-29 13962.47 14000.02 13949.50 13962.47 0.0 13983.010 0.0
2017-12-30 12481.44 12508.19 12481.44 12481.44 0.0 12504.095 0.0
2017-12-31 12540.25 12555.11 12511.00 12540.25 0.0 12540.055 0.0

365 rows × 7 columns

To generate a new dataframe with both changes applied to the respective column in one line of code, we will use the resample, lambda and apply methods:

  • For High we will max of the values of High for each minute.

  • For Low we will use min.

  • For Volume and USD Volume we will use sum.

  • For Open we will use the first item (index [0]).

  • For Close we will use the last item (index [-1]).

In [22]:
daily = bitcoin.resample('1D').apply({'Open'       : lambda x: x[0],
                                      'High'       : np.max,
                                      'Low'        : np.min,
                                      'Close'      : lambda x: x[-1],
                                      'Volume'     : np.sum,
                                      'USD Volume' : np.sum})
daily
Out[22]:
Open High Low Close Volume USD Volume
Date
2017-01-01 974.55 1003.54 965.00 998.57 1070.013119 1.061594e+06
2017-01-02 998.57 1035.90 995.00 1020.23 1449.193752 1.476088e+06
2017-01-03 1020.23 1038.26 1014.06 1035.03 2298.844903 2.356241e+06
2017-01-04 1035.03 1155.54 1035.02 1143.16 3252.704150 3.548748e+06
2017-01-05 1143.16 1167.16 879.00 1004.84 6034.569411 5.998187e+06
... ... ... ... ... ... ...
2017-12-27 15830.86 16500.00 14521.00 15418.95 9255.819566 1.431533e+08
2017-12-28 15418.95 15499.99 13514.81 14488.61 11001.121472 1.565894e+08
2017-12-29 14488.61 15115.10 13949.50 14574.99 6391.623337 9.279851e+07
2017-12-30 14574.99 14607.77 12481.44 12675.68 8734.652933 1.158837e+08
2017-12-31 12675.68 14191.20 12511.00 13820.26 8176.382068 1.082456e+08

365 rows × 6 columns

Notice that the Close of one day still matches the Open of the next

Moreover, we have deliberately left out Middle since we want to recalculate it on a daily basis, as it will be different from what we calculate on a minute basis!

We will leave the USD Volume based on minute data since that will give a more accurate estimate and this column is not exact since we don't have the trades list

Let's recalculate Middle on this new daily data

In [23]:
daily['Middle'] = (daily['High'] + daily['Low']) / 2

daily
Out[23]:
Open High Low Close Volume USD Volume Middle
Date
2017-01-01 974.55 1003.54 965.00 998.57 1070.013119 1.061594e+06 984.270
2017-01-02 998.57 1035.90 995.00 1020.23 1449.193752 1.476088e+06 1015.450
2017-01-03 1020.23 1038.26 1014.06 1035.03 2298.844903 2.356241e+06 1026.160
2017-01-04 1035.03 1155.54 1035.02 1143.16 3252.704150 3.548748e+06 1095.280
2017-01-05 1143.16 1167.16 879.00 1004.84 6034.569411 5.998187e+06 1023.080
... ... ... ... ... ... ... ...
2017-12-27 15830.86 16500.00 14521.00 15418.95 9255.819566 1.431533e+08 15510.500
2017-12-28 15418.95 15499.99 13514.81 14488.61 11001.121472 1.565894e+08 14507.400
2017-12-29 14488.61 15115.10 13949.50 14574.99 6391.623337 9.279851e+07 14532.300
2017-12-30 14574.99 14607.77 12481.44 12675.68 8734.652933 1.158837e+08 13544.605
2017-12-31 12675.68 14191.20 12511.00 13820.26 8176.382068 1.082456e+08 13351.100

365 rows × 7 columns

Plotting TSD

Typically, the first plot used for this type of data is a line plot

In [24]:
import plotly.express as px
fig = px.line(daily.reset_index(), x='Date', y='Middle')
fig.show()

However, if we had the data of one more cryptocurrency (yes, there's more than bitcoin!), we could compare their trends and see which one is trading higher!

And that is why people came up with... RUNNING BAR CHARTS (RBC)!

Fig 1.

Let's see how this example was produced using Python

We will use a file called city_populations.txt which you can download from here

This file contains population data of different world cities through time

In [32]:
df = pd.read_csv('https://gist.githubusercontent.com/johnburnmurdoch/4199dbe55095c3e13de8d5b2e5e5307a/raw/fa018b25c24b7b5f47fd0568937ff6c04e384786/city_populations')
df
Out[32]:
name group year value subGroup city_id lastValue lat lon
0 Agra India 1575 200.0 India Agra - India 200.0 27.18333 78.01667
1 Agra India 1576 212.0 India Agra - India 200.0 27.18333 78.01667
2 Agra India 1577 224.0 India Agra - India 212.0 27.18333 78.01667
3 Agra India 1578 236.0 India Agra - India 224.0 27.18333 78.01667
4 Agra India 1579 248.0 India Agra - India 236.0 27.18333 78.01667
... ... ... ... ... ... ... ... ... ...
6247 Vijayanagar India 1561 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6248 Vijayanagar India 1562 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6249 Vijayanagar India 1563 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6250 Vijayanagar India 1564 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6251 Vijayanagar India 1565 480.0 India Vijayanagar - India 480.0 15.33500 76.46200

6252 rows × 9 columns

Let's clean this a little bit...

In [33]:
# These are the only columns I need
df = df[['name','group','year','value']]
df
Out[33]:
name group year value
0 Agra India 1575 200.0
1 Agra India 1576 212.0
2 Agra India 1577 224.0
3 Agra India 1578 236.0
4 Agra India 1579 248.0
... ... ... ... ...
6247 Vijayanagar India 1561 480.0
6248 Vijayanagar India 1562 480.0
6249 Vijayanagar India 1563 480.0
6250 Vijayanagar India 1564 480.0
6251 Vijayanagar India 1565 480.0

6252 rows × 4 columns

We will change the names of the columns to something more coherent!

In [34]:
df = df.rename(columns={"name": "City", "group": "Continent","year": "Year", "value":"Population"})
df
Out[34]:
City Continent Year Population
0 Agra India 1575 200.0
1 Agra India 1576 212.0
2 Agra India 1577 224.0
3 Agra India 1578 236.0
4 Agra India 1579 248.0
... ... ... ... ...
6247 Vijayanagar India 1561 480.0
6248 Vijayanagar India 1562 480.0
6249 Vijayanagar India 1563 480.0
6250 Vijayanagar India 1564 480.0
6251 Vijayanagar India 1565 480.0

6252 rows × 4 columns

Notice that in the Continent column we have India for some cities

If I'm correct, India is a sub-continent rather than a continent!

Let's see the continents listed here...

In [36]:
set(df['Continent'])
Out[36]:
{'Asia', 'Europe', 'India', 'Latin America', 'Middle East', 'North America'}

Middle East and Latin America are also listed as continents!

In [38]:
# Cities in the "Indian continent"
print(set(df[df['Continent']=='India']['City']))
{'Vijayanagar', 'Lucknow', 'Mumbai', 'Ahmedabad', 'Bijapur', 'Kolkatta', 'Cuttack', 'Gauda', 'Delhi', 'Agra'}
In [40]:
# Cities with "Middle East" continent
print(set(df[df['Continent']=='Middle East']['City']))
{'Tabriz', 'Esfahan', 'Mashhad', 'Cairo', 'Fez'}
In [41]:
# Cities with "Middle East" continent
print(set(df[df['Continent']=='Latin America']['City']))
{'Rio de Janeiro', 'Mexico City', 'Sao Paulo', 'Buenos Aires'}

Let's first replace India -> Asia

In [42]:
df['Continent'] = df['Continent'].replace('India', 'Asia')

With the other two is slightly different, as these categories contain cities from different continents!

In fact I may need to change "one by one"

In [45]:
df['Continent'][df['City']=='Fez']='Africa'
df['Continent'][df['City']=='Cairo']='Africa'
df['Continent'] = df['Continent'].replace('Middle East', 'Asia')
df['Continent'][df['City']=='Mexico City']='North America'
df['Continent'] = df['Continent'].replace('Latin America', 'South America')
df
C:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[45]:
City Continent Year Population
0 Agra Asia 1575 200.0
1 Agra Asia 1576 212.0
2 Agra Asia 1577 224.0
3 Agra Asia 1578 236.0
4 Agra Asia 1579 248.0
... ... ... ... ...
6247 Vijayanagar Asia 1561 480.0
6248 Vijayanagar Asia 1562 480.0
6249 Vijayanagar Asia 1563 480.0
6250 Vijayanagar Asia 1564 480.0
6251 Vijayanagar Asia 1565 480.0

6252 rows × 4 columns

It shows a warning, but it works!

In [47]:
# Show that the relacement has been successful
df[df['City']=='Mexico City']
Out[47]:
City Continent Year Population
3508 Mexico City North America 1952 3736.6
3509 Mexico City North America 1953 3922.4
3510 Mexico City North America 1954 4108.2
3511 Mexico City North America 1955 4294.0
3512 Mexico City North America 1956 4531.0
... ... ... ... ...
3572 Mexico City North America 2016 21172.8
3573 Mexico City North America 2017 21346.6
3574 Mexico City North America 2018 21520.4
3575 Mexico City North America 2019 21694.2
3576 Mexico City North America 2020 21868.0

69 rows × 4 columns

I will save the new dataset locally

In [48]:
df.to_csv(r'city_populations_updated.csv',index=False)

Now, let's try to do our own RBC with Python and the matplotlib package

In [49]:
# Importing the necessary packages
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

If we just plot city vs population using ax.barh, we will see the last value of population for each of the cities

In [52]:
fig, ax = plt.subplots(figsize=(15, 8))
ax.barh(df['City'], df['Population'])
Out[52]:
<BarContainer object of 6252 artists>

To do a running bar chart, we need to define a given year and sort the cities from most to least populated (for that year)

In [55]:
current_year = 1500
df2 = (df[df['Year'].eq(current_year)]
       .sort_values(by='Population',ascending=False))
df2
Out[55]:
City Continent Year Population
276 Beijing Asia 1500 672.0
6186 Vijayanagar Asia 1500 500.0
1055 Cairo Africa 1500 400.0
1791 Hangzhou Asia 1500 250.0
5595 Tabriz Asia 1500 250.0
1503 Gauda Asia 1500 200.0
2142 Istanbul Europe 1500 200.0
4682 Paris Europe 1500 185.0
1537 Guangzhou Asia 1500 150.0
3751 Nanjing Asia 1500 147.0
1289 Cuttack Asia 1500 140.0
1496 Fez Africa 1500 130.0

Notice that in this dataset, we don't have all cities for each year so we would get different charts

In [56]:
# Plotting a basic bar chart
fig, ax = plt.subplots(figsize=(15, 8))  
ax.barh(df2['City'], df2['Population'])
Out[56]:
<BarContainer object of 12 artists>

To add colours and labels (for each continent), we will use a dictionary that pairs cities with their respective continent

In [57]:
colours = dict(zip(
    ['Europe', 'Asia', 'South America', 'North America', 'Africa'],
    ['#adb0ff', '#90d595','#aafbff', '#f7bb5f', '#eafb50']))
group_lk = df.set_index('City')['Continent'].to_dict()
group_lk
Out[57]:
{'Agra': 'Asia',
 'Ahmedabad': 'Asia',
 'Amsterdam': 'Europe',
 'Ayutthaya': 'Asia',
 'Beijing': 'Asia',
 'Berlin': 'Europe',
 'Bijapur': 'Asia',
 'Birmingham': 'Europe',
 'Boston': 'North America',
 'Buenos Aires': 'South America',
 'Cairo': 'Africa',
 'Chicago': 'North America',
 'Cuttack': 'Asia',
 'Delhi': 'Asia',
 'Dhaka': 'Asia',
 'Edirne': 'Europe',
 'Esfahan': 'Asia',
 'Fez': 'Africa',
 'Gauda': 'Asia',
 'Guangzhou': 'Asia',
 'Hangzhou': 'Asia',
 'Istanbul': 'Europe',
 'Karachi': 'Asia',
 'Kolkatta': 'Asia',
 'Kyoto': 'Asia',
 'Lisbon': 'Europe',
 'London': 'Europe',
 'Los Angeles': 'North America',
 'Lucknow': 'Asia',
 'Manchester': 'Europe',
 'Mashhad': 'Asia',
 'Mexico City': 'North America',
 'Moscow': 'Europe',
 'Mumbai': 'Asia',
 'Nanjing': 'Asia',
 'Naples': 'Europe',
 'New York': 'North America',
 'Osaka': 'Asia',
 'Paris': 'Europe',
 'Pegu': 'Asia',
 'Philadelphia': 'North America',
 'Rio de Janeiro': 'South America',
 'Saint Petersburg': 'Europe',
 'Sao Paulo': 'South America',
 'Seoul': 'Asia',
 'Shanghai': 'Asia',
 'Suzhou': 'Asia',
 'Tabriz': 'Asia',
 'Tokyo': 'Asia',
 'Venice': 'Europe',
 'Vienna': 'Europe',
 'Vijayanagar': 'Asia'}

Now we can use this pairing to get some colours in df2

In [60]:
fig, ax = plt.subplots(figsize=(15, 8))
df2 = df2[::-1]   # flip values from top to bottom
# pass colours values to `color=`
ax.barh(df2['City'], df2['Population'], color=[colours[group_lk[x]] for x in df2['City']])
# iterate over the values to plot labels and values (Tokyo, Asia, 38194.2)
for i, (value, name) in enumerate(zip(df2['Population'], df2['City'])):
    ax.text(value, i,     name,            ha='right')
    ax.text(value, i-.25, group_lk[name],  ha='right')
    ax.text(value, i,     value,           ha='left')
# Add year right middle portion of canvas
ax.text(1, 0.4, current_year, transform=ax.transAxes, size=46, ha='right')
Out[60]:
Text(1, 0.4, '1500')

This function will draw a more "stylish" function to create the last chart

In [61]:
# Frankly, I don't know what half of it does...
def draw_barchart(year):
    dff = df[df['Year'].eq(year)].sort_values(by='Population', ascending=True).tail(10)
    ax.clear()
    ax.barh(dff['City'], dff['Population'], color=[colours[group_lk[x]] for x in dff['City']])
    dx = dff['Population'].max() / 200
    for i, (value, name) in enumerate(zip(dff['Population'], dff['City'])):
        ax.text(value-dx, i,     name,           size=14, weight=600, ha='right', va='bottom')
        ax.text(value-dx, i-.25, group_lk[name], size=10, color='#444444', ha='right', va='baseline')
        ax.text(value+dx, i,     f'{value:,.0f}',  size=14, ha='left',  va='center')
    # ... polished styles
    ax.text(1, 0.4, year, transform=ax.transAxes, color='#777777', size=46, ha='right', weight=800)
    ax.text(0, 1.06, 'Population (thousands)', transform=ax.transAxes, size=12, color='#777777')
    ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    ax.xaxis.set_ticks_position('top')
    ax.tick_params(axis='x', colors='#777777', labelsize=12)
    ax.set_yticks([])
    ax.margins(0, 0.01)
    ax.grid(which='major', axis='x', linestyle='-')
    ax.set_axisbelow(True)
    ax.text(0, 1.12, 'The most populous cities in the world ',
            transform=ax.transAxes, size=24, weight=600, ha='left')
    plt.box(False)
In [63]:
# Now we execute the function
fig, ax = plt.subplots(figsize=(15, 8))
draw_barchart(1988)

The most important part is to animate the plot for a range of years

In [65]:
import matplotlib.animation as animation
from IPython.display import HTML
fig, ax = plt.subplots(figsize=(15, 8))
animator = animation.FuncAnimation(fig, draw_barchart, frames=range(2000,2020))
HTML(animator.to_jshtml()) 
Out[65]:


Once Loop Reflect

Any easier way to do this?

Besides the large amount of code, one of the biggest letdown of this is the lack of animation for the bars to "pass" each other!

Therefore,...

Lab

There are two parts, one for TSD and another for RBC

The first one consists on following the instructions to treat TSD data

The second one has the code to do the RBC

For that one, I want you to get creative!

  • Can you change colours, styles, names, etc.?
  • Can you skip some years?
  • Can you get rid of the static plot that gets printed after the animation?
  • Can you show not only the top 10 cities?
  • Can you show all cities at the same time?
  • Can you put the years of the range in the title of the chart?
  • Can you run it from 1500 to 2020?
  • Can you make the chart go back in time?

Alternatively you can download the dataset from Moodle and see if other tools are able to these things!